Add rampup batch size support in MaxText #2535

NuojCheng · 2025-10-22T22:39:57Z

Description

This PR adds support for ramp-up batch size, a feature originally proposed in the GPT-3 paper and implemented in Megatron.

When enabled, the per device batch size starts at a smaller value (per_device_batch_size_start) and gradually increases (per_device_batch_size_increment) until it reaches the target per_device_batch_size over a specified number of rampup_samples. This can help improve training stability, especially during the initial training phases.

This feature introduces four new configuration parameters, which align with the Megatron implementation:

enable_rampup_batch_size: (default: False) Set to True to enable the ramp-up feature.
per_device_batch_size_start: The per-device batch size to use at the beginning of training.
per_device_batch_size_increment: The amount to increase the per-device batch size at each ramp-up step.
global_rampup_samples: The total number of samples to process before reaching the full target batch size.

The PR includes the following changes:

RampupDataLoader: Adds a new RampupDataLoader class that inherits from the base DataLoader. Its primary responsibility is to truncate the input data to match the correct ramp-up shape for the current training step.
Metric Logger: Updates the metric logger to prevent flops and token counts associated with metadata from being logged.
Config Updates: Modifies pyconfig.py to register and validate the new ramp-up configuration parameters.
Testing: Adds new tests to data_loader_tests.py to verify the RampupDataLoader's slicing and increment logic.

FIXES: b/452468482

Tests

New data_loader_test.
Synthetic dataset: https://paste.googleplex.com/6438658005663744
HF dataset: https://paste.googleplex.com/5162826226335744
TFDS: https://paste.googleplex.com/5772840095842304
TFDS on mixtral-8x7b XPK test command / log
HF on mixtral-8x7b XPK test command / log

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

src/MaxText/configs/base.yml

github-actions · 2025-10-24T20:48:35Z

🤖 Hi @NuojCheng, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

RissyRan · 2025-10-29T17:10:07Z

🤖 Hi @NuojCheng, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

It seems out of quota for free tier. We are going to update the Tier 1, should be better soon.

Attempt 1 failed with status 429. Retrying with backoff... ApiError: {"error":{"message":"{\n  \"error\": {\n    \"code\": 429,\n    \"message\": \"You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits. To monitor your current usage, head to: https://ai.dev/usage?tab=rate-limit.\\n* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 2\\nPlease retry in 20.529613201s.\",\n    \"status\": \"RESOURCE_EXHAUSTED\",\n    \"details\": [\n      {\n        \"@type\": \"type.googleapis.com/google.rpc.QuotaFailure\",\n        \"violations\": [\n          {\n            \"quotaMetric\": \"generativelanguage.googleapis.com/generate_content_free_tier_requests\",\n            \"quotaId\": \"GenerateRequestsPerMinutePerProjectPerModel-FreeTier\",\n            \"quotaDimensions\": {\n              \"location\": \"global\",\n              \"model\": \"gemini-2.5-pro\"\n            },\n            \"quotaValue\": \"2\"\n          }\n        ]\n      },\n      {\n        \"@type\": \"type.googleapis.com/google.rpc.Help\",\n        \"links\": [\n          {\n            \"description\": \"Learn more about Gemini API quotas\",\n            \"url\": \"[https://ai.google.dev/gemini-api/docs/rate-limits\](https://ai.google.dev/gemini-api/docs/rate-limits/)"\n          }\n        ]\n      },\n      {\n        \"@type\": \"type.googleapis.com/google.rpc.RetryInfo\",\n        \"retryDelay\": \"20s\"\n      }\n    ]\n  }\n}\n","code":429,"status":"Too Many Requests"}}

github-actions · 2025-10-29T17:39:56Z

🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

src/MaxText/data_loader.py

github-actions · 2025-10-31T16:24:51Z

🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

📋 Review Summary

This pull request introduces a batch size ramp-up feature to improve training stability, which is a valuable addition. The implementation is generally clean and follows existing patterns in the codebase. The configuration, data loading, and metric logging components are well-integrated.

🔍 General Feedback

The refactoring of the sharding logic out of the DataLoader and into the training loop is a good improvement for separation of concerns.
The use of a factory function create_dataloader is a clean way to handle the conditional creation of the RampUpDataLoader.
I've left a couple of minor suggestions to improve an assertion message and to correct the logic in a unit test.

src/MaxText/pyconfig.py

tests/data_loader_test.py

RissyRan

Thanks Nuojing! LGTM! If you could response to Gemini's comments, that will be great! We could chat more on future deliveries offline, but not a blocker for this PR.

RissyRan

LGTM for this PR! let's chat more on remaining tasks in the meeting today. Thank you!

src/MaxText/data_loader.py

RissyRan · 2025-11-02T04:33:56Z

We discussed offline, and agree to merge this feature request first. Leave 2 action items for follow up: 1) learning rate adjustment if needed due to this feature; 2) metrics report like throughput during the batch ramp up.

-- 40e056d by NuojCheng <[email protected]>: add rampup batch size COPYBARA_INTEGRATE_REVIEW=#2535 from AI-Hypercomputer:chengnuojin-rampup-batch 40e056d PiperOrigin-RevId: 827037473

-- 40e056d by NuojCheng <[email protected]>: add rampup batch size COPYBARA_INTEGRATE_REVIEW=AI-Hypercomputer#2535 from AI-Hypercomputer:chengnuojin-rampup-batch 40e056d PiperOrigin-RevId: 827037473

NuojCheng added the draft Draft PR label Oct 22, 2025

NuojCheng changed the title ~~Add rampup batch size support in MaxText~~ [WIP] Add rampup batch size support in MaxText Oct 22, 2025

gobbleturk reviewed Oct 23, 2025

View reviewed changes

src/MaxText/configs/base.yml Outdated Show resolved Hide resolved

NuojCheng force-pushed the chengnuojin-rampup-batch branch 4 times, most recently from 57ff3e8 to 842193d Compare October 24, 2025 00:53

NuojCheng changed the title ~~[WIP] Add rampup batch size support in MaxText~~ Add rampup batch size support in MaxText Oct 24, 2025

NuojCheng added gemini-review and removed draft Draft PR labels Oct 24, 2025

NuojCheng force-pushed the chengnuojin-rampup-batch branch from 842193d to 82f5edf Compare October 24, 2025 20:48

NuojCheng marked this pull request as ready for review October 24, 2025 20:49

NuojCheng requested review from A9isha, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, gagika, hengtaoguo, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners October 24, 2025 20:49

RissyRan added gemini-review and removed gemini-review labels Oct 29, 2025

NuojCheng force-pushed the chengnuojin-rampup-batch branch 2 times, most recently from 4114fa7 to b911ca8 Compare October 31, 2025 02:14

gobbleturk reviewed Oct 31, 2025

View reviewed changes

src/MaxText/data_loader.py Outdated Show resolved Hide resolved

RissyRan added gemini-review and removed gemini-review labels Oct 31, 2025

github-actions bot reviewed Oct 31, 2025

View reviewed changes

src/MaxText/pyconfig.py Show resolved Hide resolved

tests/data_loader_test.py Outdated Show resolved Hide resolved

RissyRan reviewed Oct 31, 2025

View reviewed changes

NuojCheng force-pushed the chengnuojin-rampup-batch branch 3 times, most recently from 24e0d29 to 5e7f164 Compare October 31, 2025 20:07

RissyRan approved these changes Oct 31, 2025

View reviewed changes

aireenmei approved these changes Oct 31, 2025

View reviewed changes

src/MaxText/data_loader.py Outdated Show resolved Hide resolved

NuojCheng force-pushed the chengnuojin-rampup-batch branch 2 times, most recently from 47a1028 to d3fed21 Compare November 1, 2025 00:09

NuojCheng added the pull ready label Nov 1, 2025

NuojCheng force-pushed the chengnuojin-rampup-batch branch from d3fed21 to 40e056d Compare November 1, 2025 00:12

NuojCheng removed the gemini-review label Nov 1, 2025

NuojCheng force-pushed the chengnuojin-rampup-batch branch 3 times, most recently from 3375760 to 9143004 Compare November 2, 2025 06:02

add rampup batch checkpoint support

3da3587

NuojCheng force-pushed the chengnuojin-rampup-batch branch from 9143004 to 3da3587 Compare November 11, 2025 01:45

NuojCheng requested a review from NicoGrande as a code owner November 11, 2025 01:45

NuojCheng closed this Nov 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add rampup batch size support in MaxText #2535

Add rampup batch size support in MaxText #2535

Uh oh!

NuojCheng commented Oct 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

github-actions bot commented Oct 24, 2025

Uh oh!

RissyRan commented Oct 29, 2025

Uh oh!

github-actions bot commented Oct 29, 2025

Uh oh!

Uh oh!

github-actions bot commented Oct 31, 2025

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Uh oh!

RissyRan left a comment

Uh oh!

RissyRan left a comment

Uh oh!

Uh oh!

RissyRan commented Nov 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add rampup batch size support in MaxText #2535

Add rampup batch size support in MaxText #2535

Uh oh!

Conversation

NuojCheng commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

Uh oh!

github-actions bot commented Oct 24, 2025

Uh oh!

RissyRan commented Oct 29, 2025

Uh oh!

github-actions bot commented Oct 29, 2025

Uh oh!

Uh oh!

github-actions bot commented Oct 31, 2025

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

📋 Review Summary

🔍 General Feedback

Uh oh!

Uh oh!

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RissyRan commented Nov 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NuojCheng commented Oct 22, 2025 •

edited

Loading